Sales trend analysis Report

Analyzing sales and revenue for REC corp LTD.
Author
Affiliation
Published

February 21, 2023

1 Introduction:

REC Corp LTD. is a small-scale business venture located in India that has been selling four different sales for over ten years. The company has collected sales data from their retail centers, which has been organized into a small csv file. The CEO of REC Corp LTD. has tasked me with analyzing this data to answer several important questions about their business.

1.1 Project Description:

The aim of this project is to analyze the sales data of REC Corp LTD. and provide valuable insights into their business. My analysis will help the company make informed decisions about their sales and improve their overall performance. I will use statistical and machine learning techniques to provide answers to the following questions:

  1. Is there any trend in the sales of all four sales during certain months?

  2. Out of all four sales, which product has seen the highest sales in all the given years?

  3. What would be the estimate on the number of units of each product sold on the 31st of December every year if all the retail centers were kept open?

  4. Would dropping the production of any one of the sales result in a massive setback for the company?

  5. Can we predict the sales and revenues for the year 2024 with the best possible accuracy?

By answering these questions, I will provide valuable insights into the sales patterns and overall performance of REC Corp LTD. My analysis will also help the company make informed decisions that can improve their business outcomes.

Tip

The dataset used for this project can be found here

1.2 Dataset Structure and Preparation

Before conducting any analysis on the sales data provided by REC Corp LTD., it is important to perform data cleaning to ensure the data is accurate and reliable. This involves reading the data into the R global environment, checking for duplicates, missing values, and formatting the columns to the correct data types.

In the code chunk below, I will first read the data from the CSV file into R. I will then check for duplicates and missing values to ensure that my analysis is based on reliable data. Additionally, I will format the columns to their correct data types, such as dates and numeric values, to make them consistent with the expected data types for the analysis.

This data cleaning process is crucial as it ensures that our analysis is based on accurate and reliable data. By cleaning the data, I can avoid errors and inconsistencies in my analysis, which could lead to incorrect conclusions and decisions.

# read data from disk and call it sales
# also clean variable names into consistent names
sales <- read_csv("data/statsfinal.csv", show_col_types = F) %>% clean_names()
New names:
• `` -> `...1`
# print first 6 rows
DT::datatable(head(sales))
# check duplicates 
anyDuplicated(sales)
[1] 0
# Apply a data transformation step that is preparing the dataset for further analysis by renaming the variables to more intuitive names and converting the date column to a date data type.

sales <- sales %>% 
  # rename variables to intuitive names 
  rename(
    id = x1,
    quantity_p1 = q_p1,
    quantity_p2 = q_p2,
    quantity_p3 = q_p3,
    quantity_p4 = q_p4,
    revenue_p1 = s_p1,
    revenue_p2 = s_p2,
    revenue_p3 = s_p3,
    revenue_p4 = s_p4
    ) %>% 
  # format date column to date data type
  mutate(date = as.Date(date, format = "%d-%m-%Y"))

DT::datatable(head(sales))
# check missing values 
any_na(sales)
[1] TRUE

After checking for missing values in the sales data, I found some missing values in the date column. Upon closer inspection, I discovered discovered that the missing values were associated with invalid dates. Such dates include September 31st, November 31st, etc. Since these are not valid dates, the observations associated with them should be considered as errors and should be excluded from the analysis. These observations could be the result of data entry errors or data quality issues. Fortunately, the percentage of these missing values is only 0.06%.

# remove observations with missing values 
sales <- sales %>% filter(!is.na(date))

# confirm removal
any_na(sales)
[1] FALSE

2 Is there any trend in the sales of all four sales during certain months?

To answer this question, I will analyze the combined sales figures for the four sales over the given time period. By analyzing the combined data, I can identify any consistent trends or patterns in the sales figures that apply to all four sales.

sales %>% 
  mutate(
    # create a variable(combined_sales) that sums the total sales unit for every observation
    combined_sales = quantity_p1 + quantity_p2 + quantity_p3 + quantity_p4,
    # derive the month variable from the date column
    month = lubridate::month(date, label = F),
    # derive the year variable from date column
    year  = lubridate::year(date)
  ) %>% 
  # group the observations by the year and month variable
  group_by(year, month) %>% 
  # calculate the mean combined sales for every year and month 
  summarise(avg_combined_sales = mean(combined_sales), .groups = "drop") %>% 
  # create a bar plot of average combined sales by month
  ggplot(aes(as.factor(month), avg_combined_sales, fill = "red", alpha = .5))+
  # removed the unnecessary legend made by geom_col
  geom_col(show.legend = F)+
  # facet by year and let each plot have its own x axis
  facet_wrap(~year)+
  theme_bw()+
  labs(
    title = "Average combined sales by month per year",
    subtitle = "Plot Showing Average sales across all four products by month for all business years",
    caption = "Data source: https://www.kaggle.com/datasets/ksabishek/product-sales-data",
    y = "Average sales units",
    x = "Month"
  )+
  theme(
    plot.background = element_rect(fill = "gray92")
  )

# combined sales units over time
sales %>% 
  mutate(
    # create a variable(combined_sales) that sums the total sales unit for every observation
    combined_sales = quantity_p1 + quantity_p2 + quantity_p3 + quantity_p4,
    # derive the month variable from the date column
    month = lubridate::month(date, label = F),
    # derive the year variable from date column
    year  = lubridate::year(date)
  ) %>% 
  timetk::plot_time_series(
    .date_var = date, 
    .value = combined_sales,
    .title = "Combined Sales units over time"
    )

2.1 Insights generated

After conducting an analysis of the sales data for REC Corp LTD.’s four products, I have found that there is no discernible pattern or trend in the sales for certain months, as shown in the two plots above. This suggests that the sales of these products are not significantly impacted by seasonality or other factors that would cause sales to increase or decrease in specific months.

While it may be disappointing not to find any clear trend or pattern, this insight is still valuable to REC Corp LTD. as it highlights that their sales performance is relatively stable and not overly reliant on particular seasons or other external factors. This insight can inform the company’s business strategy and help them to plan for the future more effectively.

Since there is no clear trend or pattern in the sales, the company can focus on other factors that may impact sales performance, such as product quality, pricing, marketing, and customer experience. This insight can help REC Corp LTD. to continue to grow and improve their business by focusing on these other factors that may impact their sales.

3 Out of all four sales, which product has seen the highest sales in all the given years?

The second question that we are trying to answer is which of the four products has seen the highest sales in all the given years. By identifying the product that generates the highest revenue, REC Corp LTD. can better understand their customer’s preferences and tailor their product strategy accordingly. In addition, knowing which product performs the best can help the company allocate their resources more effectively and make informed business decisions that drive growth and profitability. In this section, I will analyze the sales data and determine which of the four products has seen the highest sales over the given years.

total_sales <- sales %>% 
  # calculate the sum of sales of all products over the whole span
  summarise(
    product_1 = sum(quantity_p1),
    product_2 = sum(quantity_p2),
    product_3 = sum(quantity_p3),
    product_4 = sum(quantity_p4)
  ) %>% 
  pivot_longer(
    everything(), names_to = "product", values_to = "total_sales"
  )

# print total sales
DT::datatable(total_sales)
# create a bar plot to visualize the result
total_sales %>%  # use the total_sales data frame
  ggplot(
    aes(x = fct_reorder(product, total_sales), # reorder products by total sales
        y = total_sales, fill = "red", alpha = .5  # set fill color and transparency
      )
    )+
  geom_col(show.legend = F)+  # plot a column chart, turn off legend
  geom_text(  # add text labels to each column
    aes(
      label = format(total_sales, big.mark = ","),  # format sales numbers with commas
      fontface = "bold",  # set text to bold
      hjust = 1  # align text to the right of the column
        )
    )+
  coord_flip()+  # flip x and y axes
  theme_bw()+  # set plot theme to black and white
  labs(
    title = "Total sales units by product",  # add title and subtitle
    subtitle = "Plot showing the overall sales for each product for all business years",
    y = "Total Sales",  # label y axis
    x = "Product"  # label x axis
  )+
  theme(  # adjust plot theme settings
    plot.background = element_rect(fill = "gray92"),  # set plot background color
    axis.text.x = element_blank(),  # hide x axis tick labels
    axis.ticks.x = element_blank(),  # hide x axis ticks
    legend.position = "none",  # hide legend,
    axis.text = element_text(face = "bold.italic") # set axis text to bold & italic
  )

3.1 Insights Generated

Based on my analysis of the sales data for REC Corp LTD.’s four products, I have found that product 1 has seen the highest sales in all the given years, with 18,860,169 units sold. This is followed by product 3 with 14,379,603 units sold, then product 2 with 9,741,271 units sold, and finally product 4 with 5,139,979 units sold.

This insight provides valuable information for REC Corp LTD. as it highlights that products 1 and 3 are performing better than others by a big margin. By focusing on these top-performing products, the company can invest more resources in these products and potentially generate even more sales and revenue. Additionally, this information can help the company to identify areas where they need to improve the performance of their lower-performing products (product 2 and product 4) or potentially even consider discontinuing these products if they do not align with the company’s goals.

Overall, the insight gained from this analysis can help REC Corp LTD. to make informed decisions about their product strategy and resource allocation, which can drive growth and profitability for the company.

4 What would be the estimate on the number of units of each product sold on the 31st of December every year if all the retail centers were kept open?

To estimate the sales for December 31st, I will calculate the average sales for all the Decembers where data is available and use that average as an estimate for the closed day. This approach assumes that the sales pattern for the closed day is similar to the sales pattern for the rest of the month. However, this approach has some limitations, as it doesn’t take into account any special events or changes in customer behavior that may affect the sales on the closed day.

# make the table output interactive
DT::datatable(sales %>%
    mutate(
      # Create a new variable 'month' using the date column
      month = lubridate::month(date, label = T)
    ) %>% 
    # Filter to only keep observations where the month is December
    filter(month == "Dec") %>% 
    # Group the data by year
    group_by(year = lubridate::year(date)) %>% 
    # calculate the average quantity sold for each product
    summarise(
      product1_estimate = round(mean(quantity_p1),1),
      product2_estimate = round(mean(quantity_p2),1),
      product3_estimate = round(mean(quantity_p3),1),
      product4_estimate = round(mean(quantity_p4),1)
      )
    )

4.1 Point to Note

Based on the analysis of the available data, the table above shows the estimated number of units that could have been sold on December 31st for each of the four products. It is important to note that due to the retail centers being closed on that day, these estimates are purely theoretical and should be taken with caution. Nonetheless, this information could be useful in forecasting sales and optimizing inventory levels in preparation for the holiday season.

Important

Further research could be done to identify patterns in sales leading up to December 31st to better inform the above estimates.

5 Can we predict the sales and revenues for the year 2024 with the best possible accuracy?

# subset sales to get id, datetime feature and value for product 1
product_1 <- sales %>% select(date, quantity_p1)

# visulize the sales of product 1 over time 
product_1 %>% 
  plot_time_series(
    .date_var = date, 
    .value = quantity_p1,
    .title = "Product 1 Sales Over time"
    )
# I want to create forecast models to predict the sales for the year 2024, 
# this means I will be forecasting for the next 23 months from the end of my time series 

# Split data 
splits <- time_series_split(
  product_1, 
  assess = "6 months", # split the last 6 months of the data
  cumulative = TRUE
  )
Using date_var: date
# visualize the splits
splits %>% 
  tk_time_series_cv_plan() %>% # extract time series cross validation plan (cv)
  plot_time_series_cv_plan(.date_var = date, .value = quantity_p1) # plot cv

In the code above, the time series data for product_1 is split into training and testing sets using the time_series_split() function. The splitting is done to evaluate the accuracy of the forecast models.

The assess argument in time_series_split() specifies that the last 6 months of the data will be used as the testing set. This means that the models will be trained on the first part of the time series data and evaluated on the last 6 months.

The cumulative argument is set to TRUE, which means that the training data used to fit the model includes all data prior to each training set. This is useful for time series data because the model can learn from the patterns in the historical data as it progresses through the training sets. Lets now fit some forecasting models.

5.1 Modeltime Workflow

The modeltime workflow is a framework for building, evaluating, and comparing multiple time series models. It allows for easy comparison of different types of models and their respective accuracy. The workflow consists of several stages:

  1. Model definition: choosing and defining the models to be used in the analysis

  2. Model fitting: fitting the models to the training data

    This is achieved using the code below

# create forecast models

# 1. fit auto arima
model_arima <- arima_reg() %>% 
  set_engine("auto_arima") %>% 
  fit(quantity_p1 ~ date, training(splits))
frequency = 7 observations per 1 week
# 2. fit prophet model
model_prophet <- prophet_reg(seasonality_yearly = TRUE) %>% 
  set_engine("prophet") %>% 
  fit(quantity_p1 ~ date, training(splits))
Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
# 3. fit glmnet---machine learing model
model_glmnet <- linear_reg(penalty = 0.01) %>% # penalized regression
  set_engine("glmnet") %>% 
  fit(                                        # expand features
    quantity_p1 ~ wday(date, label = TRUE)
                + month(date, label = TRUE)
                + as.numeric(date),
    training(splits)
    )
Model Description
  1. Auto ARIMA: This is an automated version of the ARIMA model which uses a stepwise approach to search for the best parameters for the model. It works by identifying the optimal order of differencing (d), the order of the autoregressive term (p), and the order of the moving average term (q). This model can capture both trend and seasonality in the data.
  1. Prophet: This is a time series forecasting model developed by Facebook. It is designed to capture seasonal trends, as well as changes in trend over time, using a piecewise linear model with a flexible set of priors that can be set by the user. It also allows the inclusion of external variables that might influence the time series, such as holidays or other special events.

  2. GLMNET: This is a regularized regression model which combines Lasso and Ridge regression, allowing for feature selection and shrinking the coefficients of the model. In this code, the model uses the day of the week, month, and date features to predict sales.

  1. Model calibration: calibrating the models and calculating their predictions and residuals for the test data
# organize models for modeltime workflow
model_tbl <- modeltime_table(
  model_arima,
  model_prophet,
  model_glmnet
)

# perform calibration (calculate predictions and residuals for the test data)
calib_tbl <- model_tbl %>% 
  modeltime_calibrate(testing(splits))
  1. Model evaluation: evaluating the accuracy of the models on the test data
# Test accuracy of our test set predictions
calib_tbl %>% modeltime_accuracy()
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ARIMA(0,0,0) WITH NON-ZERO MEAN Test 2039.472 130.2198 0.8061685 57.80697 2298.768 NA
2 PROPHET Test 2066.254 130.0835 0.8167548 58.46905 2329.875 0.0212375
3 GLMNET Test 2056.347 129.5668 0.8128388 58.26103 2314.525 0.0085526
  1. Model selection: selecting the best performing model or combination of models.

The RMSE values for the three models are quite close, with the Arima model having the lowest RMSE, followed closely by GLMNET and prophet with the highest RMSE. This indicates that the ARIMA model performed slightly better than the other models in predicting future sales for product 1. However, it’s important to note that the difference in RMSE between the three models is not very large, and it’s possible that another model might perform better depending on the data and the specific business problem being addressed. It’s always a good idea to consider the strengths and weaknesses of each model before selecting the best one for a particular application.

  1. Forecasting: using the selected model(s) to forecast future data.

After calibrating the models with the test set and checking their accuracy, the next stage is to use the models to make predictions for the future. In this case, the models are used to forecast the sales for Product 1 for the next 23 months beyond the end of the time series data.

The modeltime_refit() function is used to refit the models on the entire dataset (training and testing sets combined) to improve their accuracy before making the forecasts. The modeltime_forecast() function is then used to generate forecasts for the next 23 months. The actual sales data for Product 1 is also provided to the function to compare the predicted sales with the actual sales.

Finally, the plot_modeltime_forecast() function is used to visualize the forecasts along with the actual sales data. The resulting plot shows the predicted values for Product 1 sales over the next 23 months as well as the upper and lower bounds of the confidence interval. This information can be used to inform decision-making and planning for the future sales of Product 1.

# visualize test set predictions 
calib_tbl %>% modeltime_forecast(
    new_data = testing(splits),
    actual_data = product_1
  ) %>% 
  plot_modeltime_forecast()
# forecast the future
future_forecast_tbl <- calib_tbl %>% 
  modeltime_refit(product_1) %>% 
  modeltime_forecast(
    h = "23 months",
    actual_data = product_1
  )
frequency = 7 observations per 1 week
Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
# visualize future forecast
future_forecast_tbl %>% 
  plot_modeltime_forecast()